Skip to content
This repository has been archived by the owner on Sep 24, 2024. It is now read-only.

Initial LMBuddy class for running jobs #84

Merged
merged 17 commits into from
Mar 18, 2024

Conversation

sfriedowitz
Copy link
Contributor

@sfriedowitz sfriedowitz commented Mar 14, 2024

What's changing

  • Removes the run_job method in favor of a class LMBuddy that has methods for finetune and evaluate
  • Implements job-result data structures for returning data generated by job entrypoints, instead of simply having the entrypoints write to disk/W&B
  • Implements a new LoadableAssetPath type and associated data structures to represent any load_from path for a HF asset. See inline comments for motivation for this change.

Note that the CLI API is not changed by these internal changes, so you can still execute the package as a Ray entrypoint in the same manner as before.

How to test it

  • Run the test suite
  • Pull the branch and try running some jobs with the new interface

Related Jira Ticket

Additional notes for reviewers

In follow-up PRs into this dev branch, I would like to do the following:

@sfriedowitz sfriedowitz changed the base branch from main to dev/RD2024-147/buddy-class March 14, 2024 19:54
@sfriedowitz sfriedowitz changed the title Buddy class for running jobs Initial LMBuddy class for running jobs Mar 14, 2024
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a duplicate of the other hf_config.yaml file since it already has the quantization section specified.

)
print("Logging artifact for model checkpoint...")
artifact_loader.log_artifact(model_artifact)
ckpt_path, artifact_config = None, None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a follow-up PR (https://mzai.atlassian.net/browse/RD2024-152), I would like to refactor a bit how we are generating artifacts and results in these methods.

The issue is that because the tracking field is optional, we repeatedly end up having to write two code branches, (1) for when we initialize a W&B run and create an artifact and (2) for when we run the job without tracking or artifacts. It's almost like we want something like maybe_initialize_wandb_run that handles the optionality of the tracking service.

src/lm_buddy/buddy.py Outdated Show resolved Hide resolved
@sfriedowitz sfriedowitz marked this pull request as ready for review March 14, 2024 21:48
@veekaybee
Copy link
Member

Looking now! Pulled the branch, reading the instructions for direct_job_execution.ipynb and have put together a a script like this. Where do we specify the cluster information in the new workflow? Seems like it should be somewhere here right? FinetuningRayConfig

from ray.job_submission import JobSubmissionClient
from pathlib import Path


from lm_buddy import LMBuddy
from lm_buddy.jobs.configs import (
    FinetuningJobConfig,
    FinetuningRayConfig,
    LMHarnessJobConfig,
    LMHarnessEvaluationConfig,
)
from lm_buddy.integrations.huggingface import (
    AutoModelConfig,
    TextDatasetConfig,
    TrainerConfig,
    AdapterConfig,
)
from lm_buddy.integrations.wandb import WandbRunConfig

# Base model to finetune from HuggingFace
model_config = AutoModelConfig(load_from="distilgpt2")

# Text dataset for finetuning
dataset_config = TextDatasetConfig(
    load_from="imdb",
    split="train[:100]",
    text_field="text",
)

# HuggingFace trainer arguments
trainer_config = TrainerConfig(
    max_seq_length=256,
    per_device_train_batch_size=8,
    learning_rate=1e-4,
    num_train_epochs=1,
    logging_strategy="steps",
    logging_steps=1,
    save_strategy="epoch",
    save_steps=1,
)

# LORA adapter settings
adapter_config = AdapterConfig(
    peft_type="LORA",
    task_type="CAUSAL_LM",
    r=8,
    lora_alpha=16,
    lora_dropout=0.2,
)

# Define tracking for finetuning run
tracking_config = WandbRunConfig(
    name="example-finetuning",
    project="lm-buddy-examples",  # Update to your project name
    entity="mozilla-ai",  # Update to your entity name
)

# Ray train settings
ray_config = FinetuningRayConfig(
    use_gpu=False,  # Change to True if GPUs are available on your machine
    num_workers=2,
)

# Full finetuning config
finetuning_config = FinetuningJobConfig(
    model=model_config,
    dataset=dataset_config,
    trainer=trainer_config,
    adapter=adapter_config,
    tracking=tracking_config,
    ray=ray_config,
)

@sfriedowitz
Copy link
Contributor Author

Where do we specify the cluster information in the new workflow?

Nothing is changing in how you specify the cluster information. The CLI of the package is not changed, so you can use the same commands as an entrypoint to a Ray job submission using their SDK.

src/lm_buddy/buddy.py Outdated Show resolved Hide resolved
src/lm_buddy/cli/run.py Outdated Show resolved Hide resolved
src/lm_buddy/paths.py Outdated Show resolved Hide resolved
src/lm_buddy/paths.py Outdated Show resolved Hide resolved
@veekaybee
Copy link
Member

veekaybee commented Mar 15, 2024

Tests and left some comments, unit tests pass and sample job works!

@sfriedowitz
Copy link
Contributor Author

Thanks! Im a bit side tracked atm but will address most all of them in the next few hours.

Copy link
Member

@veekaybee veekaybee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adressing! LGTM

@sfriedowitz sfriedowitz merged commit 4c08a29 into dev/RD2024-147/buddy-class Mar 18, 2024
3 checks passed
@sfriedowitz sfriedowitz deleted the sfriedowitz/buddy-job-runner branch March 18, 2024 18:24
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants